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Abstract 

We consider a Bayesian approach to model selection in Gaussian 
linear regression, where the number of predictors might be much larger 
than the number of observations. From a frequentist view, the pro- 
posed procedure results in the penalized least squares estimation with 
a complexity penalty associated with a prior on the model size. We in- 
vestigate the optimality properties of the resulting model selector. Wc 
establish the oracle inequality and specify conditions on the prior that 
imply its asymptotic minimaxity within a wide range of sparse and 
dense settings for "nearly-orthogonal" and "multicollinear" designs. 



1 Introduction 

Consider the standard Gaussian linear regression model 



y = X/3 + e, 



where y € M. n is a vector of the observed response variable Y, X nxp is the 
design matrix of the p explanatory variables (predictors) X±, ...,X p , (3 £ MP 
is a vector of unknown regression coefficients, e ~ iV(0, cr 2 I n ) and the noise 
variance a 2 is assumed to be known. 

A variety of statistical applications of regression models involves a vast 
number of potential explanatory variables that might be even large relatively 
to the amount of available data. It raises a severe "curse of dimensionality" 
problem. Reducing dimensionality of the model becomes therefore crucial in 
the analysis of such large data sets. The goal of model (or variable) selection 
is to select the "best", parsimonious subset of predictors. The corresponding 
coefficients are then usually estimated by least squares. The meaning of the 
"best" subset however depends on the particular aim at hand. One should 
distinguish, for example, between estimation of regression coefficients f3, 
estimation of the mean vector X(3, model identification and predicting future 
observations. Different aims may lead to different optimal model selection 
procedures especially when the number of potential predictors p might be 
much larger than the sample size n. In this paper we focus on estimating the 
mean vector X(3 and the goodness of a model (subset of predictors) M is 
measured by the quadratic risk E\ \ X/3 M -X(3\ \ 2 = \ \X/3 M -X(3\ \ 2 + a 2 \M\ , 
where (3 M is the least squares estimate of (3 and X(3 M is the projection of X{3 
on the span of M. The first (bias) term of the risk decomposition represents 
the approximation error of the projection, while the second (variance) term 
is the price for estimating the projection coefficients (3 M by (3 M and is 
proportional to the model size. The "best" model then is the one with the 
minimal quadratic risk. Note that the true underlying model in (TjQ) is not 
necessarily the best in this sense since sometimes it is possible to reduce its 
risk by excluding predictors with small (but still nonzero!) coefficients. 

Such a criterion for model selection is obviously impossible to imple- 
ment since it depends on the unknown f3. Instead, the corresponding ideal 
minimal risk can be used as a benchmark for any available model selection 



procedure. The model selection criteria are typically based on the empirical 



direct minimization of the empirical risk evidently leads to a trivial (unsat- 
isfactory!) choice of the saturated model. A typical remedy is then to add 
a complexity penalty Pen(\M\) that increases with the model size, and to 
consider penalized least squares criterion of the form 



The properties of the resulting estimator depends on the proper choice 
of the complexity penalty function Pen(-) in ([2]). There exists a plethora 
of works in literature on this problem. The standard, most commonly used 
choice is a linear type penalty of the form Pen(k) = 2a 2 Xk for some fixed A > 
0. The most known examples motivated by different ideas include AIC for 
A = 1 (Akaike, 1974), BIC for A = (lnn)/2 (Schwarz, 1978) and RIC for A = 
Inp (Foster h George, 1994). A series of recent works suggested the so-called 
2/uTn(p//c)-type nonlinear penalties of the form Pen(k) = 2a 2 ck(ln(p/k) + 
Cp,k), where c > 1 and Cp,fc is some "negligible" term (see, e.g., Birge & 
Massart, 2001, 2007; Johnstone, 2002; Abramovich et al, 2006; Bunea, 
Tsybakov & Wegkamp, 2007). 

In this paper we present a Bayesian formalism to the model selection 
problem in Gaussian linear regression (TjQ) that leads to a general penalized 
model selection rule ([2]). The proposed Bayesian approach can be used, 
in fact, as a natural tool for obtaining a variety of penalized least squares 
estimators with different complexity penalties that accommodate many of 
the known model selection procedures as particular cases corresponding to 
specific choices of the prior. Within Bayesian framework, the penalty term 
in (|2|) is interpreted as proportional to the logarithm of a prior distribution. 
Complexity penalties Pen(\M\) imply placing a prior on the model size 
(the number of nonzero entries of /3). Minimization of (J2j) corresponds to 
the maximum a posteriori (MAP) rule yielding the resulting MAP model 
selector to be the posterior mode. 



quadratic risk | |y — X(3 M 



2 , which is essentially the least squares. However, 




M 



Although there exists a large amount of literature on Bayesian model 
selection (see George & McCulloch, 1993, 1997; Chipman, George k, Mc- 
Cullogh, 2001; Liang et al, 2008 for surveys), it mainly focuses on "purely 
Bayesian" issues (e.g., prior specification, posterior calculations, etc.) and 
does not investigate the optimality of the resulting Bayesian procedures from 
a frequentist view. In this paper we study the optimality properties of the 
proposed MAP model selectors for estimating the mean vector X(3 in ([1]). 
First, under mild conditions on the prior we establish the oracle inequal- 
ity and show that, up to a constant multiplier, they achieve the minimal 
possible risk among all estimators. We then investigate their asymptotic 
minimaxity. For "nearly-orthogonal" design they are proved to be simulta- 
neously rate-optimal (in the minimax sense) over a wide range of sparse and 
dense settings and outperform various existing model selection procedures, 
e.g. AIC, BIC, RIC, Lasso (Tibshirani, 1996) and Dantzig selector (Candes 
& Tao, 2007). In a way, these results extend those of Abramovich, Grin- 
shtein & Pensky (2007) and Abramovich et al. (2010) for the normal means 
problem corresponding to the particular case X = I n . 

The analysis of "multicollinear" design, which is especially relevant for 
"p much larger than n" setup, is more delicate. We demonstrate that the 
lower bounds for the minimax rates for estimating the mean vector in this 
case are smaller than those for "nearly-orthogonal" design by the factor 
depending on the design properties. Such "blessing of multicollinearity" can 
be explained by a possibility of exploiting correlations between predictors to 
reduce the size of a model (hence, to decrease the variance) without paying 
much extra price in the bias term. We show that under some additional 
assumptions on the design and the coefficients vector /3 in (|Tj). the proposed 
Bayesian model selectors are still asymptotically rate-optimal. 

The paper is organized as follows. The Bayesian model selection proce- 
dure that leads to a penalized least squares estimator ([2j) is introduced in 
Section [2j In Section [3] we derive an upper bound for the quadratic risk of 



the resulting MAP model selector, compare it with that of an oracle and find 
the conditions on the prior where, up to a constant multiplier, it achieves the 
minimal possible risk among all estimators. In Section 0] we obtain the up- 
per and lower risk bounds of the MAP model selector in a sparse setup that 
allows us to investigate its asymptotic minimaxity for nearly-orthogonal and 
multicollinear designs in Section [5j The computational aspects are discussed 
in Section [6l and the main take-away messages of the paper are summarized 
in Section [7J All the proofs are given in the Appendix. 

2 MAP model selection procedure 

Consider the Gaussian linear regression model ([1]), where the number of 
possible predictors p might be even larger then the number of observations 
n. Let r = rank(X)(< min(p, n)) and assume that any r columns of X 
are linearly independent. For the "standard" linear regression setup, where 
all p predictors are linearly independent and there are at least p linearly 
independent design points, r = p. 

Any model M is uniquely defined by the px p diagonal indicator matrix 
Dm = diag(d.M), where dju = ^{Xj G M} and, therefore, \M\ = tr(D^f). 
The corresponding least square estimate (3 M = (DmX' XDm) + DmX'y, 
where "+" denotes the generalized inverse matrix. 

Assume some prior on the model size ir(k) = P(\M\ = k) , where ir{k) > 
0, k = 0, r (k = corresponds to a null model with a single intercept) and 
n(k) = for k > r since otherwise, there necessarily exists another vector 
(3* with at most r nonzero entries also satisfying ([1]), that is, X(3 = X(3*. 

For any k = 0, ...,r — 1 there are (jf) different models of a given size k. 
Assume all of them to be equally likely, that is, conditionally on \M\ = k, 

P{M | \M\ =k) = 

One should be a little bit more careful for k = r = rank(X). Although 
there are different sets of predictors of size r, all of them evidently result 




in the same estimator for the mean vector and, in this sense, are essentially 
undistinguishable and associated with a single (saturated) model. Hence, in 
this case, we set 

P(M | \M\ = r) = 1 (3) 

Finally, assume the normal prior on the unknown vector of k coefficients 
of the model M: (3 M ~ N p (0^a 2 (D M X' XD M ) + )- This is a well-known 
conventional g-prior of Zellner (1986). 

For the proposed hierarchical prior, straightforward calculus yields the 
posterior probability of a model M of size \M\ = 0, ...,r — 1: 

P{M\y) oc n\M\)[ ) (1+7) 2 exp J 



\M\J v " * 17 + 1 2a 2 

(4) 

Finding the most likely model leads therefore to the following maximum a 
posteriori (MAP) model selection criterion: 

y'l%(%l'l%) + %X'y+2^(l+l/7) In j (j^j vr(|Af|)(l + 7 )~^ 
or, equivalently, 

||y - Xp M \\ 2 + 2a 2 (l + l/ 7 ) In ( ( . f^. ) 7r(|ikf I)" 1 (1 + 7)^ ) ^ ™, (5) 



which is of the general type ([2]) with the complexity penalty 

Pen(k) = 2<T 2 (l + l/7)ln|Qvr(A;)- 1 (l + 7)l|, fc = 0,...,r-l (6) 

Similarly, for |M| = r from ([3]) one has 

Pen(r) = 2a 2 (l + I/7) In {7r(r) -1 (l + 7)3} (7) 

A specific form of the penalty (J!])-© depends on the choice of a prior 
7r(-). In particular, the (truncated if p > n) binomial prior B(p,£) cor- 
responds to the prior assumption that the indicators djM are indepen- 
dent. The binomial prior yields the linear penalty Pen(k) = 2a 2 Xk, where 



A = (1 + l/ 7 ) ln{^/TT7(l - £)/£} ~ ln{\/7(! - £)/£} for sufficiently large 
variance ratio 7. The AIC criterion corresponds then to £ ~ y/l/( e + \/7)) 
while £ ~ y/l/ip + \/7) leads to the RIC criterion. These relations indi- 
cate that RIC should be appropriate for sparse cases, where the size of the 
true (unknown) model is believed to be much less than the number of pos- 
sible predictors, while AIC is suitable for dense cases, where they are of the 
same order. In fact, any binomial prior or, equivalently, any linear penalty 
cannot "kill two birds with one stone". On the other hand, the (trun- 
cated) geometric prior n(k) oc q k , k = l,...,r for some < q < 1, implies 
Pen(k) ~ 2<j 2 (l + l/j)k(ln(p/k) + c(j,q)) which is of the 2k ln{p/ k)-type 
introduced above. For large 7 it behaves similar to RIC for k <C p and to 
AIC for k ~ p and is, therefore, adaptive to both sparse and dense cases. 
We will discuss these issues more rigorously in Section below. 



3 Oracle inequality 

In this section we derive an upper bound for the quadratic risk of the pro- 
posed MAP model selector and compare it with the ideal minimal quadratic 
risk often called in literature as an oracle risk. 

Assumption (P). Assume that 

7r(*0 < Q e " C(7)fc > k = °> ■■■>*■- 1, and 7r(r) < e" c(7)r 
where c( 7 ) = 8(7 + 3/4) 2 > 9/2. 

Assumption (P) is not restrictive. Indeed, the obvious inequality (?) > 
(p/k) k implies that for k < r it automatically holds for any prior ir(k) for all 
k < pe~ c ^\ Assumption (P) is used to establish an upper bound for the 
quadratic risk of the MAP model selector. 

Theorem 1. Let the model M be the solution of (0) with the complexity 
penalty Pen(-) given in (G|)-([?P and 0^ be the corresponding least squares 



estimate. Then, under Assumption (P) 

^|X^-X/3|| 2 < C o(7)mf{||X/3 M -X/3|| 2 + P e n(|M|)}+ Cl (7y 2 (8) 

for some 00(7) and 01(7) depending only on 7. 

To assess the quality of the upper bound in ([8]), we compare it with the 
oracle risk infjw E\ \X/3 M — X/3|| 2 . Note that the oracle risk is exactly zero 
when j3 = and, evidently, no estimator can achieve it in this case. Hence, 
an additional, typically negligible term a 2 , which is, essentially, an error of 
estimating a single extra parameter, is usually added to the oracle risk for a 
proper comparison. It is known that no estimator can attain a risk smaller 
than within 2\np factor from that of an oracle (e.g., Foster & George, 1994; 
Donoho & Johnstone, 1995; Candes, 2006). The following theorem shows 
that under certain additional conditions on the prior 7r(-), the resulting MAP 
model selector achieves this minimal possible risk among all estimators up 
to a constant multiplier depending on 7: 

Theorem 2 (oracle inequality). Let Tr(k) satisfy Assumption (P) and, in 
addition, 7r(0) > p~ c , 7r(fe) > p~ ck , k = 1, ...,r for some constant c > 0. 
Then, the resulting MAP model selector satisfies 

E\\XPa - X(3\\ 2 < c 2 (j)\np (mfE\\X$ M - X/3\\ 2 + a 2 ) 

for some 02(7) > 1. 

In particular, it can be easily shown that Theorem [2] holds for the (trun- 
cated) binomial B(p,£) with £ = 1/p (RIC criterion) and geometric priors 
(see Section[2]). More generally, all priors such that lnir(k) = 0(k\n(k/p)) 
corresponding to the 2fcln(p/A;)-type penalties satisfy the conditions of The- 
orem [2j 

4 Risk bounds for sparse settings 

In the previous section we considered the global behavior of the MAP esti- 
mator without any restrictions on the model size. However, in the analysis 



of large data sets, it is typically reasonable to assume that the true model 
in ([1]) is sparse in the sense that only part of coefficients in f3 are different 
from zero. We now show that under such extra sparsity assumption, more 
can be said on the optimality of the MAP model selection. 

For a given 1 < po < r, define the sets of models -M Po that have at most 
Po predictors, that is, Ai po = {M : \M\ < po}. Obviously, if a true model 
in ([1]) belongs to M Po , the Iq quasi- norm of the corresponding coefficients 
vector ||/3||o < Po, where \\(3\\o is the number of its nonzero entries. In this 
section we find the upper and lower bounds for the maximal risk of the 
proposed MAP model selector over M Po . 

Theorem 3. Let 1 < po < r, 7r(-) satisfy Assumption (P) and, in addition, 
^(po) > (po/(pe)) cpo if Po < r or 7r(r) ^ e_CT if Po = r f or some constant 
c > £(7). Then, there exists a constant C\{^) > depending only on -y such 
that 

sup E\\XP A - A/3|| 2 < C 1 ( 7 )a 2 min(p (ln(p/po) + l),r) (9) 
)9:||)9||o<po 

The general upper bound Q for the maximal risk of the MAP selector 
over M vo in Theorem [3] holds for any design matrix X. To assess its accuracy 
we establish the lower bound for the minimax risk of estimating the mean 
vector Xj3 in ([1]). 

For any given k = 1, ...,r, let 4> m i n [k] and 4> ma x[k] be the /s-sparse mini- 
mal and maximal eigenvalues of the design defined as 

6 ■ fjfcl - min 

6 \k]- max ™ 2 

(see Meinshausen & Yu, 2009; Bickel, Ritov & Tsybakov, 2009). In fact, 
(j>min[k] and (j>max[k] are respectively the minimal and maximal eigenvalues 
of all k x k submatrices of the matrix X'X generated by any k columns of 



X. Let r[k] = 4>min[k]/(i)max[k], k = 1, ...,r and set r[k] = r[r] for all k > r. 
By the definition, r\k\ is a non-increasing function of k. Obviously, r[k] < 1 
and for the orthogonal design the equality holds for all k. 

Theorem 4. Consider the model (OJ) and let 1 < po < r. There exists a 
universal constant C2 > such that 



where the infimum is taken over all estimates y of the mean vector X(3. 

Theorem H] shows that the minimax lower bound (|1U|) depends on a 
specific design matrix X only through the sparse eigenvalues ratios. A com- 
putationally simpler but less accurate minimax lower bound can be obtained 
by replacing r[2po] and r[po] in (fTUj) by r[r], that for the case r = p < n is 
just the ratio of the minimal and maximal eigenvalues of X'X. 

For the orthogonal design, where r[-] = 1, and p < n analogous results 
were obtained in Birge &: Massart (2001). For a general design and pq < r/2 
similar minimax lower bounds were independently obtained in Raskutti, 
Wainwright &: Yu (2009) for a design matrix of a full rank and in Rigollet 
&; Tsybakov (2010) for a general case within a related aggregation context. 

The established upper and lower bounds @, (|10|) for the risk of the MAP 
model selector allow us in the following section to investigate its asymptotic 
minimaxity as both n and p increase. 

5 Asymptotic adaptive minimaxity 
5.1 Nearly-orthogonal design 

In this section we consider the asymptotic properties of the MAP model 
selector as the sample size n increases. We allow p = p n to increase with n as 
well and look for a projection of the unknown mean vector on an expanding 




1 < Po < r/2 
r/2 < pq < r 
(10) 



span of predictors. In particular, the most challenging cases intensively 
studied nowadays in literature are those, where p > n or even p 3> n. In 
such asymptotic settings one should essentially consider a sequence of design 
matrices X HjPn , where r n — > oo. For simplicity of exposition, in what follows 
we omit the index n and denote X ntPn by X p emphasizing the dependence on 
the number of predictors p and let r tend to infinity. Similarly, we consider 
now sequences of coefficients vectors (3 p and priors vr p (-). In these notations, 
the original model (Q]) is transformed into a sequence of models 



where rank(X p ) = r and any r columns of X p are linearly independent 
(hence, T p [r] > 0), e ~ iV(0, (J 2 I n ) and the noise variance a 2 does not depend 
on n and p. One can also view a sequence of models (jlip in a triangular 
array setup (Greenshtein & Ritov, 2004). 

Definition 1. Consider the sequence of design matrices X p . The design is 
called nearly-orthogonal if the corresponding sequence of sparse eigenvalues 
ratios r p [r] is bounded away from zero by some constant c > 0. Otherwise, 
the design is called multicollinear. 

Nearly-orthogonality condition essentially means that there is no mul- 
ticollinearity in the design in the sense that there are no "too strong" lin- 
ear relationships within any set of r columns of X p . Intuitively, it is clear 
that in this case p cannot be "too large" relative to r and, therefore, to 
n. Indeed, apply the upper and lower bounds Q, (|10|) for po = r/2 to get 
(C2/2)<7 2 T[r]r(ln(2p/r) + 1) < C\{^)o- 2 r that implies the following remark: 

Remark 1. For nearly- orthogonal design, necessarily p = 0{r) and, there- 
fore, p = 0(n). 

The following corollary is an immediate consequence of Theorems [3] and 



y = X p {3 p + e, 




SI 



Corollary 1. Let the design be nearly-orthogonal. 



1. As r increases, the asymptotic minimax risk of estimating the mean 
vector X p (3 p over M. Po is of the order min(po(ln(f/po) + l)i r ); that 
is, there exist two constants < C\ < C2 < 00 such that for all 
sufficiently large r, 

Ci a 2 mm(p (ln(p/p ) + l),r) < inf sup E\\y - X p f3 p \\ 2 

y /3 p :||/3 p IIo<po 

< C 2 a 2 min (p (ln(j>/p ) + 1), r) 

for all 1 < po < r. 

2. Assume Assumption (P) and, in addition, thatir p (k) > (k/(pe)) Clk , k = 
1, ...,r — 1 and 7r p (r) > e~ C2T for some constants c±,C2 > 0(7). Then, 
the corresponding MAP model selector attains the minimax conver- 
gence rates simultaneously over all A4 P0 , 1 < po < r. 

One can easily verify that the conditions on the prior of Corollary [1] 
are satisfied, for example, for the truncated geometric prior (see Section [2]) 
for all k = l,...,r. The resulting MAP model selector attains, therefore, 
the minimax rates simultaneously for all M PQ , Po = r. As we have 
mentioned, the corresponding penalty in © is of the 2k ln(p/fc)-type. On the 
other hand, no truncated binomial prior B{p, £ p ) can satisfy these conditions 
on the entire range k = 1, r. It is easy to verify that they hold for small 
£ p if k <C r but for large £ p if k ~ r. In fact, these arguments go along the 
lines with the similar results of Foster & George (1994) and Birge & Massart 
(2001, 2007). Recall that binomial prior corresponds to linear penalties of 
the type Pen(k) = 2a 2 Xk in © (see Section [2]). Foster k George (1994) 
and Birge & Massart (2001, Section 5.2) showed that the best possible risk 
of such estimators over Ai Po is only of order a 2 polnp achieved for A ~ lnp 
corresponding to the RIC criterion. It is of the same order as the optimal 
risk a 2 po(ln(p/po) + 1) for po <C p (sparse case) but larger for dense case 
(Po ~ p)- On the other hand, the risk of the AIC estimator (A = 1) is of the 
order a 2 r, which is optimal for dense but much larger for sparse case. 



Furthermore, under somewhat similar nearly-orthogonality conditions, 
Bickel, Ritov & Tsybakov (2009) showed that the well-known Lasso (Tib- 
shirani, 1996) and Dantzig (Candes & Tao, 2007) estimators achieve only 
the same sub-optimal rate <J 2 po Inp as RIC. These results are, in fact, not so 
surprising since both Lasso and Dantzig estimators are essentially based on 
convex relaxations of the /o-norm of regression coefficients \ \/3\\q in the linear 
complexity penalty 2<t 2 A||/3||o in order to replace the original combinatorial 
problem ([2} by a convex program. Thus, Lasso approximates the lo- norm 
||/3||o by the the corresponding Zi-norm ||/3||i. In particular, for the orthog- 
onal design, linear complexity penalties and Lasso yield respectively hard 
and soft thresholding of components of (3 with a fixed threshold. RIC esti- 
mator and Lasso with the optimally chosen tuning parameter (e.g., Bickel, 
Ritov He Tsybakov, 2009) result in this case in the well-known hard and soft 
universal thresholding of Donoho & Johnstone (1994) with a fixed threshold 
a^/2\np which is rate-optimal for various sparse but not dense settings. On 
the other hand, the nonlinear MAP penalty corresponds to hard threshold- 
ing with a data-driven threshold that under conditions on vr p (-) in Corollary 
[T] is simultaneously minimax for both sparse and dense cases (Abramovich, 
Grinshstein &: Pensky, 2007; Abramovich et al, 2010). 

Finally, note that for the nearly-orthogonal design, \\X p (3 p £ I — X p (3 p \ \ x 
H^pA/ — PpW: where "x" means that their ratio is bounded from below and 
above. Therefore, all the results of Corollary Q] for estimating the mean 
vector Xp(3 p in (jlip can be straightforwardly applied for estimating the 
regression coefficients /3„. This equivalence, however, does not hold for the 
multicollinear design considered below. 

5.2 Multicollinear design 

Nearly-orthogonality assumption may be reasonable in the "classical" setup, 
where p is not too large relatively to n but might be questionable for the 
analysis of high-dimensional data, where p>n, due to the multicollinearity 



phenomenon (see also Remark [T]). When this assumption does not hold, the 
sparse eigenvalues ratios in (fTUj) may tend to zero as p increases and, thus, 
decrease the minimax lower bound rate relatively to the nearly-orthogonal 
design. In this case there is a gap between the rates in the lower and upper 
bounds (|10p and ([9]). Intuitively, one can think of exploiting correlations 
between predictors to reduce the size of a model (hence, to decrease the 
variance) without paying much extra price in the bias term, and, therefore, 
to reduce the risk. We show that under certain additional assumptions on 
the design and the coefficients vector in (jlip . the upper risk bound ([9|) can 
be indeed reduced to the minimax lower bound rate (|10|) . 

For simplicity of exposition we consider the sparse casepo <r/2 although 
the corresponding conditions for the dense case r/2 < po < r can be obtained 
in a similar way with necessary changes. 

We introduce now several definitions that will be used in the sequel 
(including the proofs in the Appendix). For a given index set J and I > |j7"| 
define a / x matrix Gij which columns ej, j E J are the elements of 
the standard basis in M. 1 . Thus, for any matrix A with I columns, AG\^j 
selects the columns of A indexed by J . Similarly, for any I x / symmetric 
matrix A, G\ jAGi t j generates a (symmetric) |j7"| x \J\ submatrix of A of 
the corresponding columns and rows. 

For all k = 1, ...,r/2, define k' = \r p [2k] ■ k~\ > 1. Let Jm be an index 
set of predictors included in a model M of size k = \Jm\- For any submodel 
M' C M of size k' < k let 4>m,m' be the minimal eigenvalue of the (k — 
k')x(k-k') matrix A p (M,M') = ^j^XG'^X'^G^r^j^,. 
In fact, <7 2 Ap(M, M') is the covariance matrix of the components of the 
least squares estimate vector (3 M corresponding to a subset of predictors in 
M/M'. 

Finally, define 

4>p[k] = min max 4>m M' 

M:\M\=kM'cM:\M'\=k' 

As we show later (see the proof of Theorem in the Appendix), 0r 1 [fc] 



measures an error of approximating mean vectors X p (3 p , where ||/9 p ||o = 
k, by their projections on lower dimensional subspans of predictors. The 
stronger is multicollinearity, the better is the approximation and the larger 
is (f> p [k\. 

Theorem 5. Let T p [r] — > as r — > oo (multicollinear design). Assume the 
following additional assumptions on the design matrix X p and the (unknown) 
vector of coefficients (3 p in 177]) : 

(D) for all p there exist 1 < k p \ < k P 2 < r/2 such that 

1. ci < r p [2fc] • k < k — 1, fe = /c p i, n p2 

2. t p [2k p2 ] > (K p2 /{pe)Y* 

3. <f>p,min[2k] ■ 4> p [k] > C 3 , k = K p x, K p2 

( B ) \\P P \\ 2 oo< c 4 T p [2p ] ■ 4> p {p ] ■ (ln(p/p ) + 1), where p = \\P p \\o 

for some positive constants c%, c 2 , £3 and C4. 

Then, under the above additional restrictions, if the prior 7T P (-) satis- 
fies Assumption (P) and for all k = k p i, K p2 , ir p (k') > (k'/(pe)) ck for 
some positive c > 0(7), where k' = [r p [2fc] • k~\, the corresponding MAP 
model selector is asymptotically simultaneously minimax (up to a constant 
multiplier) over all M. po , k p ± < po < K p2 . 

Note that by simple algebra one can verify that 4> p>m in[2k] ■ (j> p [k] < 1 
and, therefore, the constant C3 in Assumption (D.3) is not larger than one. 

We have argued that multicollinearity typically arises when p^> n. One 
can easily verify that for n = 0(p a ), < a < 1, Assumption (D.2) always 
follows from Assumption (D.l) and, therefore, can be omitted in this case. 

As we show in the proof, Assumptions (D.l, D.2) and Assumption (B) 
allow one to reduce the upper bound ([9]) for the risk of the MAP model se- 
lector by the factor r p [2po], while Assumption (D.3) is required to guarantee 
that the additional constraint on j3 in Assumption (B) does not affect the 
lower bound (fTUjh 



To obtain asymptotic minimaxity of the MAP selector within the en- 
tire range 1 < po < r/2 similar to Corollary [T] for the nearly-orthogonal 
case, Assumptions (D) on the design matrix are required to be satisfied for 
all k = l,...,r/2 that might be quite restrictive. However, the results of 
Theorem [5] are more general and show the tradeoff between relaxation of 
Assumptions (D) to a smaller range of k and the corresponding constriction 
of the adaptivity range for pq. 

6 Computational aspects 

In practice, minimizing ([2]) (and ([5]) in particular) requires generally an NP- 
hard combinatorial search over all possible models. During the last decade 
there have been substantial efforts to develop various approximated algo- 
rithms for solving ([2]) that are computationally feasible for high-dimensional 
data (see, e.g. Tropp & Wright, 2010 for a survey and references therein). 
The common remedies involve either greedy algorithms (e.g., forward se- 
lection, matching pursuit) approximating the global solution by a stepwise 
sequence of local ones, or convex relaxation methods replacing the original 
combinatorial problem by a related convex program (e.g., Lasso and Dantzig 
selector for linear penalties). The proposed Bayesian formalism allows one 
instead to use a stochastic search variable selection (SSVS) techniques orig- 
inated in George & McCulloch (1993, 1997) for solving ([S]) by generating 
a sequence of models from the posterior distribution P(M\y) in @. The 
key point is that the relevant models with the highest posterior probabilities 
will appear most frequently and can be identified even for a generated sam- 
ple of a relatively small size avoiding computations of the entire posterior 
distribution. 

The SSVS algorithm for the problem at hand can be basically described 
as follows. As we have mentioned in Section [21 every model M is uniquely 
defined by the corresponding indicator vector Am and the joint posterior 
distribution of djvf is given by (H|) (up to a normalizing constant). SSVS 



uses the Gibbs sampler to generate a sequence of indicator vectors di, d m 
componentwise by sampling consecutively from the conditional distributions 
4j'|( d (-j),y) ,j = 1, ■••,£>, where d ( _ i} = (di, dj-i, d j+i , dp)'. The 
components dj can be trivially obtained as simulations of Bernoulli draws, 
where from @ the corresponding posterior odds ratio 



and ARSSj > is the increment in the residual sum of squares (RSS) 



resulting Gibbs sampler is computationally efficient and, as m increases, 
the empirical distribution of the generated sample converges to the actual 
posterior distribution of d|y. After the sequence has reached approximate 
stationarity, one can identify the most frequently appeared vector(s) d as 
potential candidate(s) to solve (|5|). 

7 Concluding remarks 

In this paper we considered a Bayesian approach to model selection in Gaus- 
sian linear regression. From a frequentist view, the resulting MAP model 
selector is a penalized least squares estimator with a complexity penalty 
associated with a prior ir(-) on the model size. Although the proposed es- 
timator was originated within Bayesian framework, the latter was used as 
a natural tool to obtain a wide class of penalized least squares estimators 
with various complexity penalties. Thus, we believe that the main take- away 
messages of the paper summarized below are of a more general interest. 

The first main take-away message is that neither linear complexity penal- 
ties (e.g., AIC, BIC and RIC) corresponding to binomial priors vr(-), nor 
closely related Lasso and Dantzig estimators can be simultaneously mini- 
max for both sparse and dense cases. We specify the class of priors and 



P^lld^y) 
P^Old^y) 



Pjd^ = l,d ( _ ]) \y) 




after dropping the j-th predictor from the model (dj = l,d(_j-))'. The 



associated nonlinear penalties that do yield such a wide adaptivity range. 
In particular, it includes 2fcln(p/fe)-type penalties. 

Another important take-away message is about the effect of multicollinear- 
ity of design. Unlike model identification or coefficients estimation, where 
multicollinearity is a "curse", it may become a "blessing" for estimating 
the mean vector allowing one to exploit correlations between predictors to 
reduce the size of a model (hence, to decrease the variance) without pay- 
ing much extra price in the bias term. Interestingly, a similar phenomenon 
occurs in a testing setup (e.g., Hall & Jin, 2010). 
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8 Appendix 

Throughout the proofs we use C to denote a generic positive constant, not 
necessarily the same each time it is used, even within a single equation. 



8.1 Proof of Theorem [T] 

Define 

Lu = M /AAVn I / 



L fe = (!/&) In I (f V" 1 ^)] >c(7), k = l,...,r-l (12) 



and 

L r = (l/r)ln7r _1 (r) > c(i) (13) 

In terms of Lfc the complexity penalty ©-(13) is Penik) = a 2 {l+l/^)k{2L] t + 
ln(l+7)). Following the arguments of the proof of Theorem 1 of Abramovich 
et al. (2007), under the Assumption (P) one has 

E ( P k ) e ~ kLk + e ~ rLr = E = 1 - <°) < 1 

k=l ^ ' k=l 

and 

(1 + l/ 7 )(2L fc + ln(l + 7)) > C(l + yj2T k )\ k = 1, r 

The proof of Theorem Q] then follows directly from Theorem 2 of Birge & 

Massart (2001). 

□ 



8.2 Proof of Theorem H 



Let L* = maxo<fc< r Lfc, where fe = 1, ...,r were defined in (|12|) -(|13|) 
and Lq = 21n7r _1 (0). Simple calculus shows that the conditions on 7r(-) in 
Theorem [2] imply L* = 0(lnj>). 

Consider first the case k > 1. From Theorem [1] we have 

E\\Xp £l -Xf3\\ 2 < C o(7)inf{||X/3 M -X/3|| 2 + a 2 (l + l/ 7 )|Af|(2L| A/ |+ln(l + 7 ))}+ Cl ( 7 )a 2 

< co( 7 )(l + 1/ 7 )(2L* + ln(l + 7 )) inf {\\X/3 M - X (3\\ 2 + \M\a 2 } + Cl ( 7 )a 2 

Ad 

< c 2 (7) (2L* + ln( 1 + 7) ) | mf E\ \ Xp M - X/3 \ | 2 + a 2 \ 

(14) 

For the degenerative case k = (M = {0}), Theorem [1] implies 

EWXP^-XPW 2 < co(7)(||X / 3|| 2 + a 2 (l + l/7)Lo)+ci(7y 2 

< c 2 ( 7 )L*(\\X(3\\ 2 + a 2 ) (15) 

Combining (fll"]) and (fT5|) completes the proof. 
□ 

8.3 Proof of Theorem H 

For all po < r, Theorem Q] and ([7]) under the assumption ir(r) > e _cr imply 

sup E||X/3 a7 -A73|| 2 < sup -X/3| | 2 <Pen(r) + Cl (j)a 2 

/3:||/3|| <PO /3:||/3||o<r 

< C 1 ( 7 )^ 2 r (16) 

On the other hand, applying the general upper bound for the risk of 
MAP model selector established in Theorem Q] for models of size po < r we 
have 

sup | \XP£-XP\ | 2 < co(7)2a 2 (l+l/ 7 ) ( In j ( P ) tt" 1 ( Po ) } + ^ ln(l + 7)) +d (7)^ 

/3:||/3|| < W V IW J 2 ; 

(17) 



Abramovich et al. (2010, Lemma 1) showed that (J^) < (pe/po) Po . Hence, 
under the conditions on tt(po) in Theorem [3l for po = 1, ...,r — 1, (|17p yields 

sup EWXp^-XPW 2 < C o(7)2a 2 (l + l/ 7 ){(c + l)poln(WPo) + vln(l + 7)} + ci(7)^ 2 
P--\\P\\o<Po 1 2 J 

< Ci(j)a 2 po(ln(p/po) + l) 

Finally, note that for p$ = r, as we have already established in (|16p . 

sup E\\X0 A -X0\\ 2 < Ci( 7 )a 2 r < Ci( 7 )a 2 r(ln(p/r) + 1) 
/3:||/3|| <r 

□ 

8.4 Proof of Theorem [4] 

The core of the proof is to find a subset B po of vectors /3, where ||/3||o < Po> 
and the corresponding subset of mean vectors Q po = {g£ M n : g = X{3, (3 € 
B PQ } such that for any gi, g2 S £/ Po , ||gi — g2|| 2 > 4s 2 (po) and the Kullback- 
Leibler divergence if(P gl , P g2 ) = l|gl 2 ~f a|12 < (1/16) lncard(</ po ). Lemma 
A.l of Bunea, Tsybakov & Wegkamp (2007) will imply then that s 2 (po) is 
the minimax lower bound over A4 po . 

To construct the desired subsets B Po and Q Po we consider three possible 
cases. 

Case 1. p < r/2 

Define the subset B Po of all vectors /3 6 MP that have po entries equal to 
C po defined later, while the remaining entries are zeros: B po = {(3 : (3 € 
{{0,C PO } P }, \\/3\\o = po}. For p < r/2 from Lemma 8.3 of Rigollet & 
Tsybakov (2010), there exists a subset B Po C B Po such that for some constant 
c > 0, lncard(£> Po ) > cpo(hi(p/po) + l)j and for any pair (3 t , (3 2 € £> po , the 
Hamming distance p(f3 1 ,(3 2 ) = Y%=iHPij ^ P23} > cf>o- 

Consider the corresponding subset of mean vectors Q Po , where card(£/ Po ) = 
card(£> po ). For any gi, g2 € G P0 and the corresponding (3 1 , (3 2 £ &po we 



then have 



||gi - g 2 || 2 = 11*09! - (3 2 )\\ 2 > <t>min[2pa] WPx - (3 2 \\ 2 > c<p min [2 Po ]C 2 Po 

(18) 

On the other hand, by similar arguments, the Kullback-Leibler diver- 
gence satisfies 

[2p \ClpiP 1 M <p max [2 Po ]C 2 p 

A l F gl' r g2i S - ^2 

SetnowC 2 = {l/16)a 2 c(ln(p/p )+l)/^ m a X [2p } ands 2 (p ) = (l/64)a 2 c 2 r[2p ]po(ln(p/p )+ 
1). Then, (HID and flS) yield | |gi-g 2 | | 2 > 4s 2 (p ), K(¥ gl ,¥ S2 ) < (1/16) In card(g po ), 
and Lemma A.l of Bunea, Tsybakov & Wegkamp (2007) completes the 
proof. 

Case 2. r/2 < p < r, p > 8 

In this case consider the subset B po = {(3 G W : (3 £ {{0, C po } Po , 0, 0} 
and apply Varshamov-Gilbert bound (see, e.g. Tsybakov, 2009, Lemma 2.9). 
It guarantees the existence of a subset B Po C B Po such that lncard(£> Po ) > 
(po/8) In 2 and the Hamming distance p(/3 1 , (3 2 ) > Po/8 for any pair /3 l5 /3 2 6 
Bpo- 

Note also that for any f3 2 £ £> P() , /3! — /3 2 has at most po non-zero 
componens and repeating the arguments for the Case 1, one achieves the 
minimax lower bound s 2 (po) = Co" 2 r[po]po > (C /2)o 2 T[po\r. 

Case 3. r/2 < p < r, 2 < p < 8 

For this case, obviously, 2 < r < 16. Consider a trivial subset B Po containing 
just two vectors j3 1 = and (3 2 that has po nonzero entries equal to C po = 
(1/64) u 2 In 2/(p max [po] . For the corresponding mean vectors gi = X(3 l = 
and g2 = */3 2 , following (fTHj) and (fT9j) one has 

^ r H8C 2 
:<r 2 " 

and 



K(F S1 ,F S2 ) < "'"^ P0 = (l/16)lncard(g po ; 



|gi - g2 1 1 2 > 4>min\po]PoC po = Ca 2 T\p ]p > (C/2)a 2 T [p ]r 



Applying Lemma A.l of Bunea, Tsybakov & Wegkamp (2007) completes 
the proof. 

□ 

8.5 Proof of Theorem [5] 

We want to show that under the conditions of Theorem we can reduce 
the rate in the upper bound ([9]) for the risk of the MAP model selector 
established in Theorem [3] over Ai Po by the factor r p [2po]- Recall that we 
derived ([9]) from the general upper bound dH) in Theorem Q] by considering 
models M of size po- Consider now 1 < k p i < po < n P 2 <r/2 and apply 
for models M' of less size p' Q = |~r p [2po]j>o] > 1. Consider an arbitrary (3 p 
with ||/3 p ||o = Po- Under the conditions of Theorem [5] on the prior vr p (-), flSJ) 
implies 

E\\X p P A y p -X p (3 p \\ 2 < Ci( 7 )inf ||Z p /3 p -X p /3 pM ,|| 2 +C 2 ( 7 )aVo(ln(pM)+l), 

(20) 

where X p j3 pM , is the projection of the mean vector X p (3 p on the span of 
M' . Comparing @ and (|20p illustrates that reduction of a model size intro- 
duces the bias. On the other hand, under Assumptions (D.l) and (D.2), a 
straightforward calculus shows then that the variance term decreases to the 
desired order poT p [2po](ln(p / 'po) + 1). The idea of the proof will be based on 
finding a model such that the resulting bias term will be at most of the 
same order as the reduced variance. 

Consider the model M of size po corresponding to j3 p and any of its 
submodels M' of size p' Q defined above. Then, M/M' is evidently a sub- 
set of predictors from M not included in M' and Dm/m' = Dm — Dm 1 , 
where diagonal indicator matrices D's were introduced in Section [2j By 



straightforward calculus one then has 



\\XpP p — X p (3 pM ,\\ 2 — (3' p D m / ai ,(D ai / m ,(DmX p X p Dm) + Dm/m') + Dm/m'P p 

< ^yii/^iiLbo-Po), (21) 

where the matrices G and A and the minimal eigenvalue 4>m,M' were defined 
in Section 15.21 

Among all submodels M' C M of size p' , choose with the maximal 
4>M,M'- Then, 4>m,M', — fiplPo] an< ^ (EU) an d Assumption (B) yield 

\\X p f3 p - X p f3 pM ,\\ 2 < CT p [2 Po ]p \n((p/p ) + 1) 

Hence, we proved that under assumptions on the prior, Assumptions (D.l, 
D.2) and (B), the upper bound for the risk of the MAP model selector 
over M po is of the minimax order r p [2po]po m GVR>) + 1)- Assumption (D.3) 
guarantees that the "least-favorable" sets B po constructed in the proof of 
Theorem U] satisfy the additional Assumption (B) on /3 and, therefore, the 
minimax lower bound ([10]) is not reduced. 
□ 



